[SPARK-56912][SQL] Simplify Cast to boolean codegen under ANSI mode#55937
[SPARK-56912][SQL] Simplify Cast to boolean codegen under ANSI mode#55937gengliangwang wants to merge 2 commits into
Conversation
5bb514b to
5a88ec7
Compare
5a88ec7 to
aee86c1
Compare
|
Audited this PR for the same lessons surfaced by @viirya and @cloud-fan on #55938 (and applied to #55934 / #55939):
So no changes needed on this PR for that review. |
aee86c1 to
365cabe
Compare
### What changes were proposed in this pull request? Extend `CastUtils.java` with `stringToBooleanExact(UTF8String, QueryContext)` and use it from `Cast.scala` for the ANSI `String -> Boolean` cast path (both eval and codegen). The non-ANSI path keeps the inline `if/else if/else evNull = true` form because it has no error to throw. ### Why are the changes needed? Part of SPARK-56908 (umbrella). The ANSI String->Boolean cast emits an 8-line `if (isTrueString) … else if (isFalseString) … else throw` block in codegen. This PR collapses it to a one-line `CastUtils .stringToBooleanExact(...)` call. ### Does this PR introduce _any_ user-facing change? No. The compiled behavior is identical; only the emitted Java source text changes. ### How was this patch tested? ``` build/sbt "catalyst/testOnly *CastSuite *CastWithAnsiOnSuite \ *AnsiCastSuite *TryCastSuite" ``` 204/204 pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.x
365cabe to
b186a2a
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
This PR is part of the SPARK-56908 codegen-simplification stack (described in your first conversation comment). The ANSI String→Boolean cast previously emitted an 8-line if/else if/else throw block in codegen and a parallel block in the eval path. The PR collapses both into a single helper call.
Prior state. castToBoolean (eval) and castToBooleanCode (codegen) each had a single _: StringType case that branched on ansiEnabled to choose between throwing invalidInputSyntaxForBooleanError and assigning null / evNull = true. The ANSI throw path duplicated the helper-method machinery from peer ANSI casts.
Design approach. Add a stringToBooleanExact(UTF8String, QueryContext) helper that performs the isTrueString → isFalseString → throw sequence and call it from a new case _: StringType if ansiEnabled branch in both castToBoolean and castToBooleanCode. The non-ANSI branch keeps the inline if/else if/else because there's no error to throw — same separation that peer casts (castToByteCode, castToIntCode, etc.) already use.
Key design decision — where the helper lives. The PR places it in CastUtils.java next to the integral/fractional narrowing helpers, but the peer string→primitive ANSI helpers (toByteExact / toShortExact / toIntExact / toLongExact) live in UTF8StringUtils.scala. See the inline comment recommending the move — it preserves the string-source/numeric-source split that UTF8StringUtils and CastUtils already maintain, and keeps CastUtils's class Javadoc accurate.
Implementation sketch. Behavior is preserved across both eval and codegen paths: the helper executes the same isTrueString / isFalseString / throw sequence; the non-ANSI inline branch keeps the same isTrueString / isFalseString / null sequence. Existing test coverage (CastSuite / CastWithAnsiOnSuite / CastWithAnsiOffSuite / TryCastSuite, plus the postgreSQL/boolean.sql.out golden file) exercises both paths.
| return d.changePrecision(precision, scale) ? d : null; | ||
| } | ||
|
|
||
| // ----- string -> boolean (ANSI: throw on invalid syntax) ----- |
There was a problem hiding this comment.
Move stringToBooleanExact to UTF8StringUtils.scala — the peer string→primitive ANSI helpers (toByteExact / toShortExact / toIntExact / toLongExact) already live there. Renaming to toBooleanExact(UTF8String, QueryContext): Boolean for symmetry, and switching the codegen call site to UTF8StringUtils.getClass.getCanonicalName.stripSuffix("$") like the peers (e.g. castToByteCode at Cast.scala:2063), gives a fully consistent shape.
This also keeps the CastUtils.java class Javadoc accurate — its current "for ANSI overflow-checked casts" line and the per-row references[] rationale apply to the integral/fractional narrowing helpers but not to a string→boolean invalid-input check.
There was a problem hiding this comment.
Done in 83a3cbb112a:
- Added
UTF8StringUtils.toBooleanExact(UTF8String, QueryContext): Booleannext to the existingtoByteExact/toShortExact/toIntExact/toLongExactpeers. - Switched the eval and codegen call sites in
Cast.scala(castToBoolean/castToBooleanCode) to use it via thegetCanonicalName.stripSuffix("$")pattern, matchingcastToByteCodeetc. - Removed
stringToBooleanExactfromCastUtils.javaalong with itsUTF8String/StringUtilsimports, so the class Javadoc's "ANSI overflow-checked casts" + per-rowreferences[]framing stays accurate.
Updated the UTF8StringUtils class doc from "casting string to numeric values" to "casting string to primitive values under ANSI mode" so the new boolean helper is in scope.
lint-scala clean, 307/307 Cast tests pass.
…s.toBooleanExact, peer the other string ANSI helpers
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Is the PR description correct, @gengliangwang ? stringToBooleanExact in this PR?
Extend CastUtils.java with stringToBooleanExact(UTF8String, QueryContext) and use it from Cast.scala for the ANSI String -> Boolean cast path (both eval and codegen). The non-ANSI path keeps the inline if/else if/else evNull = true form because it has no error to throw.
|
Good catch, @dongjoon-hyun — the description was stale. The helper was moved/renamed in the follow-up commit ( |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @gengliangwang and @cloud-fan .
|
@cloud-fan @dongjoon-hyun Thanks for the review. Merging this one to master/4.x |
### What changes were proposed in this pull request? Extend `UTF8StringUtils.scala` with `toBooleanExact(UTF8String, QueryContext)` and use it from `Cast.scala` for the ANSI `String -> Boolean` cast path (both eval and codegen), alongside the peer string ANSI helpers (`toByteExact` / `toShortExact` / `toIntExact` / `toLongExact`). The non-ANSI path keeps the inline `if/else if/else evNull = true` form because it has no error to throw. ### Why are the changes needed? Part of SPARK-56908 (umbrella). The ANSI String->Boolean cast emits an 8-line `if (isTrueString) … else if (isFalseString) … else throw` block in codegen. This PR collapses it to a one-line `UTF8StringUtils.toBooleanExact(...)` call. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` build/sbt "catalyst/testOnly *CastSuite *CastWithAnsiOnSuite *CastWithAnsiOffSuite *AnsiCastSuite *TryCastSuite" ``` 307/307 pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.x Closes #55937 from gengliangwang/SPARK-56912-cast-boolean. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 634cda0) Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
Extend
UTF8StringUtils.scalawithtoBooleanExact(UTF8String, QueryContext)and use it fromCast.scalafor the ANSIString -> Booleancast path (both eval and codegen), alongside the peer string ANSI helpers (toByteExact/toShortExact/toIntExact/toLongExact). The non-ANSI path keeps the inlineif/else if/else evNull = trueform because it has no error to throw.Why are the changes needed?
Part of SPARK-56908 (umbrella). The ANSI String->Boolean cast emits an 8-line
if (isTrueString) … else if (isFalseString) … else throwblock in codegen. This PR collapses it to a one-lineUTF8StringUtils.toBooleanExact(...)call.Does this PR introduce any user-facing change?
No.
How was this patch tested?
307/307 pass.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.x